Exploring Burglaries Patterns in Chicago and Predictive Modeling with 311 Data
Author
Yuqing(Demi) Yang
Published
December 7, 2025
Assignment Overview
In this lab, you will apply the spatial predictive modeling techniques demonstrated in the class exercise to a 311 service request type of your choice. You will build a complete spatial predictive model, document your process, and interpret your results.
Timeline & Deliverables
Due Date: November 17, 2025, 10:00AM
Deliverable: One rendered document, posted to your portfolio website.
Learning Goals
By completing this assignment, you will demonstrate your ability to:
Adapt example code to analyze a new dataset
Build spatial features for predictive modeling
Apply count regression techniques to spatial data
Implement spatial cross-validation
Interpret and communicate model results
Critically evaluate model performance
Step 0: Set Up
Code
# Load required packageslibrary(tidyverse) # Data manipulationlibrary(sf) # Spatial operationslibrary(here) # Relative file pathslibrary(viridis) # Color scaleslibrary(terra) # Raster operations (replaces 'raster')library(spdep) # Spatial dependencelibrary(FNN) # Fast nearest neighborslibrary(MASS) # Negative binomial regressionlibrary(patchwork) # Plot composition (replaces grid/gridExtra)library(knitr) # Tableslibrary(kableExtra) # Table formattinglibrary(classInt) # Classification intervalslibrary(here)# Spatstat split into sub-packageslibrary(spatstat.geom) # Spatial geometrieslibrary(spatstat.explore) # Spatial exploration/KDE# Set optionsoptions(scipen =999) # No scientific notationset.seed(5080) # Reproducibility# Create consistent theme for visualizationstheme_default <-function(base_size =11) {theme_minimal(base_size = base_size) +theme(plot.title =element_text(face ="bold", size = base_size +1),plot.subtitle =element_text(color ="gray30", size = base_size -1),legend.position ="right",panel.grid.minor =element_blank(),axis.text =element_blank(),axis.title =element_blank() )}# Set as defaulttheme_set(theme_default())cat("✓ All packages loaded successfully!\n")
# Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read("data/burglaries.shp") %>%st_transform('ESRI:102271')# Check the datacat(" - Number of burglaries:", nrow(burglaries), "\n")cat(" - CRS:", st_crs(burglaries)$input, "\n")
1.4 Visualize Point Data
1.4.1 Rodent Baiting Data
Code
# Simple point map for Rodent Baitingp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = rb_sf_2017, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Rodent Baiting 311 Service Requests",subtitle =paste0("Chicago, n = ", nrow(rb_sf_2017)) )# Density surface using 311 Rodent Baiting datap2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(rb_sf_2017)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none" ) +labs(title ="Density Surface",subtitle ="Kernel density estimation of Rodent Baiting Requests" )# Combine plots using patchworkp1 + p2 +plot_annotation(title ="Spatial Distribution of Rodent Baiting 311 Requests in Chicago",tag_levels ="A" )
The maps above show the spatial distribution of Rodent Baiting 311 requests across Chicago.
The point map on the left illustrates the raw locations of all reported rodent issues. The large number of points makes the overall pattern dense, but we can still see clear clusters in the northern and western parts of the city.
The density map on the right provides a smoother view of these patterns. Several strong hotspots emerge, especially in the Northwest Side and Near North areas. In contrast, the southern and far southwestern parts of Chicago show much lower intensity.
1.4.2 Burgalaries Data
Code
# Simple point map for Burglaries (P3)p3 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = burglaries, color ="#003f5c", size =0.15, alpha =0.5) +labs(title ="Residential Burglaries",subtitle =paste0("Chicago, n = ", nrow(burglaries)) )# Density surface for Burglaries (P4)p4 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(burglaries)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none" ) +labs(title ="Density Surface",subtitle ="Kernel density estimation of Residential Burglaries" )# Combine P3 & P4p3 + p4 +plot_annotation(title ="Spatial Distribution of Residential Burglaries in Chicago",tag_levels ="A" )
The maps show where residential burglaries were concentrated in Chicago.
The darker purple areas represent the highest concentrations of burglaries. These hot spots appear mainly in the north and south-central parts of the city.
Overall, the maps suggest that residential burglaries cluster in specific neighborhoods rather than occurring randomly across Chicago.
1.5 Chapter Summary
Load maps and boundaries of Chicago:
This chapter loads police districts, police beats, and the city boundary.
These layers help define the study area and later allow spatial analysis.
All maps are changed into the same coordinate system so they line up correctly.
Load 311 Rodent Baiting data:
This chapter loads Chicago 311 Rodent Baiting data, and keeps only records from 2017.
Load burglary data:
A shapefile of burglary cases is loaded.
It is also transformed into the same coordinate system.
This dataset will be compared with the rodent data later.
Purpose of Step 1:
Gather most of the data needed for the project.
Clean and filter the data so it is accurate and consistent.
Make basic maps to understand where events happen.
Step 2: Fishnet Grid Creation
2.1 Create Fishnet Grid
Code
# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]# View basic infocat(" - Number of cells:", nrow(fishnet), "\n")
Quantile Distribution of Rodent Baiting 311 Request Counts
Percentile
Value
0%
0
10%
0
20%
0
30%
3
40%
7
50%
11
60%
17
70%
24
80%
36
90%
56
100%
180
Most grid cells have very low request counts. According to the quantile table, 30% of the cells have 3 or fewer requests, and 50% have 11 or fewer.
This pattern shows a strong right-skewed distribution, where a small number of hotspots generate many more requests than the rest of the city.
On the map, hotspots appear mainly in the central and north neighborhoods, forming clear clusters of higher counts, and suggesting strong spatial heterogeneity.
2.4 Visualize Burgalaries with fishnet
Code
# Burglary 500m x 500m grid mapggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Residential Burglaries",option ="plasma",trans ="sqrt",breaks =c(0, 1, 2, 3, 5, 8, 40),limits =c(0, 40),oob = scales::squish ) +labs(title ="Residential Burglaries by Grid Cell",subtitle ="500m x 500m cells, Chicago, 2017" ) +theme_default()
Quantile Distribution of Residential Burglary Counts
Percentile
Value
0%
0
10%
0
20%
0
30%
0
40%
1
50%
2
60%
3
70%
4
80%
5
90%
8
100%
40
Most grid cells have very few burglaries. From the quantiles, 50% of all cells have 2 or fewer burglary cases.
The distribution is right-skewed. Many places have low values (0–3), while only a few places have high values.
On the map, brighter colors (yellow) show where burglaries are concentrated. These hotspots appear mainly in the northern and southern parts of Chicago.
The pattern suggests burglary risk is not uniform across the city. Instead, it clusters in specific neighborhoods, which is important for modeling and prediction.
2.5 Chapter Summary
Creating the Fishnet Grid:
This chapter created a 500m × 500m grid that covers the whole Chicago boundary.
Each grid cell gets a unique ID, which helps match spatial features later.
The grid makes it easier to compare different locations using the same unit of space.
Aggregating Rodent and Burglary Data:
Each 311 rodent request and burglary point is assigned to a grid cell using a spatial join.
Then, this chapter counted how many events happen in each cell.
Cells with no events are filled with zero, so the dataset stays complete.
This step converts point data into count data, which is needed for statistical modeling.
Mapping Rodent Baiting Counts:
The map shows clear hotspots where rodent activity is high.
The distribution is very skewed, meaning most cells have low values, but a few cells have extremely high values.
This pattern suggests strong spatial clustering.
Mapping Burglary Counts:
Burglary counts also show uneven spatial distribution.
Many cells have 0–3 burglaries, while only a small number of cells have high counts.
This map confirms that burglaries are not random but cluster in specific neighborhoods.
Overall Importance of Step 2:
This chapter transforms raw point data into a structured spatial dataset.
It creates the base layer (the grid) where all future variables will be attached.
It also reveals important spatial patterns that guide later modeling steps.
Step 3: Spatial Features
3.1 k-nearest neighbor
Code
# 1. Get centroid coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))# 2. Compute nearest 8 neighbors (to other grid cells)knn_result <-get.knn(fishnet_coords, k =8)# 3. Add KNN features to fishnetfishnet <- fishnet %>%mutate(knn_mean =rowMeans(matrix(countRodent[knn_result$nn.index], nrow =nrow(fishnet))),knn_sum =rowSums(matrix(countRodent[knn_result$nn.index], nrow =nrow(fishnet))) )summary(fishnet$knn_mean)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 6.281 15.562 20.992 31.500 112.125
Code
summary(fishnet$knn_sum)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 50.25 124.50 167.94 252.00 897.00
I calculated the summary and mean values of neighboring grid cells instead of the distance nearest event points. This method captures spatial spillover between neighborhoods better, providing stronger and more reliable predictive power.
3.2 Distance to Hot Spots
3.2.1 Perform Local Moran’s I analysis
Code
# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights based on k-nearest neighbors coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to rodent baiting countsfishnet <-calculate_local_morans(fishnet, "countRodent", k =5)
3.2.2 Visualize Morans
Code
#| fig-width: 8#| fig-height: 6# Visualize Local Moran's I clusters for rodent baitingggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c", # hotspots"High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6", # cold spots"Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Rodent Baiting Clusters",subtitle ="High-High = Hot spots of rodent activity" ) +theme_default()
Why there is no “Low-Low” and “High-Low” clusters?
The data shows a highly right-skewed distribution.
When high-value areas are clustered together, they are very likely to be identified as High-High (hotspot).
Low-value areas are everywhere, so “low + low + low” is too “normal” for the tests. As a result, it will not be marked as significantly “Low-Low”, but rather “Not Significant”.
In the reality, rodent activity is concentrated in a small number of hotspots, while low counts are widespread across the city.
3.2.3 Distance to Hotspots
Code
# Get centroids of "High-High" cells (rodent hotspots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest rodent hotspotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant rodent baiting hot spots found\n")}
- Number of hot spot cells: 304
3.3 Building Age
3.3.1 Building Age Feature
Code
# Read building databuilding_sf <-st_read("data/Building_Footprints_20251117.geojson") %>%# buildings_sf <- st_read(# "https://data.cityofchicago.org/resource/syp8-uezg.geojson",# quiet = TRUE# ) %>%# Clean and calculate the building years of 2017filter(!is.na(year_built)) %>%mutate(year_built =as.numeric(year_built)) %>%filter(year_built >0, year_built <=2017) %>%mutate(bldg_age_2017 =2017- year_built) %>%st_transform('ESRI:102271')
Reading layer `Building_Footprints_20251117' from data source
`/Users/demiyang/Downloads/MUSA/PPA/portfolio/portfolio-setup-demiyang12/assignments/assignment_4/data/Building_Footprints_20251117.geojson'
using driver `GeoJSON'
replacing null geometries with empty geometries
Simple feature collection with 820606 features and 46 fields (with 6 geometries empty)
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.93981 ymin: 41.6446 xmax: -87.52421 ymax: 42.023
Geodetic CRS: WGS 84
ggplot() +geom_sf(data = fishnet, aes(fill = mean_bldg_age), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Mean building age (years)",option ="magma",direction =-1 ,na.value ="grey90" ) +labs(title ="Average Building Age by Grid Cell",subtitle ="500m x 500m cells, Chicago, 2017" ) +theme_default()
The map shows large differences in building age across Chicago.
Newer buildings appear more often in the central and far north areas, where the colors are lighter.
The pattern suggests that building age is not evenly distributed.Older structures cluster in certain neighborhoods, while others have more recent development.
Many burglary hotspots appear in areas where buildings are older.
These locations, mainly in the north, and southeast parts of Chicago—show both high burglary counts and high building age.
Older buildings may have weaker physical security, such as outdated doors, windows, or locks.
Older neighborhoods may also have denser housing, which increases opportunities for burglary.
Reading layer `Boundaries_-_Zoning_Districts_(current)_20251117' from data source `/Users/demiyang/Downloads/MUSA/PPA/portfolio/portfolio-setup-demiyang12/assignments/assignment_4/data/Boundaries_-_Zoning_Districts_(current)_20251117.geojson'
using driver `GeoJSON'
Simple feature collection with 14814 features and 28 fields (with 4 geometries empty)
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94009 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS: WGS 84
Code
# Extract the letter parts of the first three characters from ZONE_CLASSzoning_sf <- zoning_sf %>%mutate(zone_prefix =substr(zone_class, 1, 3),zone_prefix =gsub("[^A-Za-z]", "", zone_prefix) ) %>%filter(zone_prefix !="") %>%mutate(zone_code = dplyr::recode( zone_prefix,"RS"="1","RT"="2","RM"="3","B"="4","C"="5","DC"="6","DR"="7","DS"="8","DX"="9","M"="10","PMD"="11","PD"="12","T"="13","POS"="14",.default =NA_character_ ) ) %>%filter(!is.na(zone_code)) %>%st_transform("ESRI:102271")zoning_sf$zone_code <-as.integer(zoning_sf$zone_code)# Calculate the modemode_first <-function(x) { x <- x[!is.na(x)]if (length(x) ==0) return(NA) ux <-unique(x) ux[which.max(tabulate(match(x, ux)))]}# Aggregate to fishnet: One zoning type for each gridzoning_by_grid <-st_join(fishnet, zoning_sf, join = st_intersects) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(zone_prefix_dom =mode_first(zone_prefix), zone_code_dom =mode_first(zone_code),.groups ="drop" )# join the zoning variable back to fishnetfishnet <- fishnet %>%left_join(zoning_by_grid, by ="uniqueID")
The map shows the dominant zoning type in each 500m × 500m grid cell in Chicago.
The map shows Chicago has a mixed pattern, but residential zones dominate most of the city.
Downtown districts (DR, DS, DX) are concentrated in the central east area.
Manufacturing (M, PMD) zones cluster in the southeast, and middlewest.
Comparing with burglary distribution, burglary hotspots occur mostly in residential zones.
This makes sense because burglary is a residential crime, so it appears mainly where people live.
Manufacturing zones have low burglary counts.
These areas have fewer homes, more industrial facilities, and lower residential density.
3.5 Chapter Summary
k-Nearest Neighbor (KNN) Feature for 311 request feature:
For every cell, this chapter found its 8 nearest neighboring cells.
Then this chapter calculated knn_mean, which is the average rodent request count in those neighbors, and knn_sum, which is the total rodent request count in those neighbors.
Distance to Hot Spots Feature(Local Moran’s I):
Each grid cell was classified into: High–High (hotspot), Low–Low, High–Low, Low–High, or Not Significant. In practice, most significant clusters are High–High hotspots, while many low-count areas are “Not Significant” instead of Low–Low.
Then this chapter took the centroids of High–High cells and measured, for every grid cell, the distance to the nearest hotspot (dist_to_hotspot).
This creates a feature that shows how close each place is to intense rodent activity.
Building Age Feature:
Building data was loaded from Chicago Data Portal, named Building Footprints.
This chapter cleaned the data and kept only buildings built before or in 2017, then computed building age in 2017.
Then, for each grid cell, this chapter calculated: n_buildings,which is this number of buildings with valid age. mean_bldg_age, which is the average building age. median_bldg_age, which is the median building age.
These values were joined back to the fishnet, so every cell has information about its building stock and age structure.
Land Use Feature:
Zoning data was loaded from Chicago Data Portal, named Boundaries - Zoning Districts (current).
This chapter aggregated the zoning data into the secondary classification of zoning classes.(reference: https://secondcityzoning.org/zones/)
For each cell, this chapter identified its function as its dominant zoning type.
A map of these categories shows how residential, commercial, downtown, industrial, and park areas are distributed across Chicago.
Purpose of Step 3:
Step 3 builds key spatial predictor variables on top of the grid: neighborhood intensity (knn_mean), proximity to rodent hotspots (dist_to_hotspot), built environment age (mean_bldg_age), and land use type (zone_code_dom).
They are later used in the Poisson and Negative Binomial regression models as main explanatory variables.
#| label: poisson-dispersion-check#| message: false#| warning: falsedispersion_rb <-sum(residuals(pois_model, type ="pearson")^2) / pois_model$df.residualcat("Dispersion parameter for rodent baiting Poisson model:",round(dispersion_rb, 2), "\n")
Dispersion parameter for rodent baiting Poisson model: 3.47
Code
cat("Rule of thumb: values > 1.5 indicate overdispersion.\n")
Rule of thumb: values > 1.5 indicate overdispersion.
Code
if (dispersion_rb >1.5) {cat("⚠ Overdispersion detected for rodent baiting counts.\n")cat(" The Poisson model is likely too restrictive.\n")cat(" A Negative Binomial model is more appropriate.\n")} else {cat("✓ Dispersion looks acceptable.\n")cat(" The Poisson model may be adequate for these counts.\n")}
⚠ Overdispersion detected for rodent baiting counts.
The Poisson model is likely too restrictive.
A Negative Binomial model is more appropriate.
In this chapter, I built statistical models to explain how burglary counts change with spatial features:
building age, zoning, distance to rodent baiting counts hotspots, and mean values of rodent baiting counts in neighboring grid cells.
Then, this chapter first fitted a Poisson regression with these variables and then checked the dispersion of the residuals. The Poisson dispersion is about 3.47, which is much larger than 1.5. This means the data are strongly overdispersed (variance much higher than the mean), so the Poisson model is too restrictive.
To handle overdispersion, research then fitted a Negative Binomial (NB) regression with the same main predictors.This model allows extra variance and is better suited for these burglary counts.
Poisson model has an AIC of about 11849, while the Negative Binomial model has an AIC of about 9837. A lower AIC means a better fit, so the Negative Binomial model performs much better.
Overall, Step 4 shows that burglary counts are highly overdispersed, and that the Negative Binomial regression is the appropriate model for capturing the relationship between burglaries and the spatial features created in previous steps.
Step 5: Spatial Cross-Validation
5.1 Leave-One-Group-Out cross-validation on 2017 data
Code
# Calculate which police district the centroid of the grid falls infishnet_centroids <-st_centroid(fishnet) %>%st_join(policeDistricts, join = st_within) %>%st_drop_geometry() %>% dplyr::select(uniqueID, District)# Construct a dataset for cross-validationfishnet_model <- fishnet %>%left_join(fishnet_centroids, by ="uniqueID") %>%filter(!is.na(District)) %>%mutate(District =as.factor(District),zone_code_dom =as.factor(zone_code_dom) ) %>%filter(!is.na(countRodent),!is.na(mean_bldg_age),!is.na(zone_code_dom),!is.na(knn_mean),!is.na(dist_to_hotspot) )# LOGO CVdistricts <-unique(fishnet_model$District)cv_results <-tibble()for (i inseq_along(districts)) { test_district <- districts[i]# Divide the training set/test set train_data <- fishnet_model %>%filter(District != test_district) test_data <- fishnet_model %>%filter(District == test_district) train_data <- train_data %>%mutate(zone_code_dom =factor(zone_code_dom)) test_data <- test_data %>%mutate(zone_code_dom =factor(zone_code_dom,levels =levels(train_data$zone_code_dom)))# Fit the NB modelmodel_cv <- MASS::glm.nb( countBurglaries ~ mean_bldg_age + zone_code_dom + dist_to_hotspot + knn_mean,data = train_data )# Make predictions on the test police district test_data <- test_data %>%mutate(prediction =predict(model_cv, newdata = ., type ="response"),abs_error =abs(countRodent - prediction),sq_error = (countRodent - prediction)^2 ) test_eval <- test_data %>%filter(!is.na(prediction)) mae <-mean(test_eval$abs_error, na.rm =TRUE) rmse <-sqrt(mean(test_eval$sq_error, na.rm =TRUE))# Save the resultcv_results <-bind_rows( cv_results,tibble(fold = i,test_district =as.character(test_district),n_test =nrow(test_eval),mae = mae,rmse = rmse ) )cat(" Fold", i, "/", length(districts),"- District", as.character(test_district),"- n_test:", nrow(test_eval),"- MAE:", round(mae, 2),"- RMSE:", round(rmse, 2), "\n")}
5.2 Calculate and report error metrics (MAE, RMSE)
Code
cv_sorted <- cv_results %>%arrange(fold)# Table 1: local MAE and RMSEcv_sorted %>% knitr::kable(digits =2,caption ="Leave-One-District-Out CV Results (per district)" ) %>% kableExtra::kable_styling(bootstrap_options =c("striped", "hover"))
Leave-One-District-Out CV Results (per district)
fold
test_district
n_test
mae
rmse
1
5
98
6.54
8.97
2
4
181
5.09
7.81
3
22
128
7.86
10.58
4
31
18
1.79
2.17
5
6
81
15.27
21.39
6
8
227
23.52
35.95
7
7
67
25.62
31.88
8
3
59
8.14
10.87
9
2
78
7.88
12.69
10
9
140
18.20
28.11
11
10
84
22.02
32.60
12
1
44
6.66
10.72
13
12
98
21.82
35.64
14
15
37
19.80
25.29
15
11
65
23.88
31.78
16
18
43
29.25
40.30
17
25
113
28.05
36.95
18
14
65
45.98
57.32
19
19
89
43.28
56.53
20
16
179
16.79
23.99
21
17
97
37.12
47.60
22
20
45
41.37
51.85
23
24
54
55.43
63.31
Code
# Table 2: Average MAE and RMSecv_summary <- cv_sorted %>%summarise(mean_MAE =mean(mae, na.rm =TRUE),mean_RMSE =mean(rmse, na.rm =TRUE),n_folds = dplyr::n() )cv_summary %>% knitr::kable(digits =2,caption ="Average MAE and RMSE across all districts (LOGO CV)" ) %>% kableExtra::kable_styling(bootstrap_options =c("striped", "hover"))
Average MAE and RMSE across all districts (LOGO CV)
mean_MAE
mean_RMSE
n_folds
22.23
29.75
23
5.3 Visualize error metrics
Code
cv_by_district <- cv_results %>%group_by(test_district) %>%summarise(mae =mean(mae, na.rm =TRUE),rmse =mean(rmse, na.rm =TRUE),.groups ="drop" )# Merge with the police district base mappolice_cv <- policeDistricts %>%mutate(District =as.character(District)) %>%left_join(cv_by_district, by =c("District"="test_district"))# RMSE spatial distributionggplot(police_cv) +geom_sf(aes(fill = rmse)) +labs(title ="LOGO CV RMSE by Police District",fill ="RMSE" ) +theme_minimal()
Code
# MAE spatial distributionggplot(police_cv) +geom_sf(aes(fill = mae)) +labs(title ="LOGO CV MAE by Police District",fill ="MAE" ) +theme_minimal()
Prediction errors vary strongly across the city.
Both RMSE and MAE maps show that some police districts have much higher prediction errors than others.
The lightest areas (middle and far-north districts) have RMSE > 60 and MAE > 50, meaning the model struggles most in these regions.
High-error districts overlap with low burglary areas
The earlier burglary map showed that the northern parts of Chicago have very low burglary counts. These same areas now show the highest prediction errors.
5.4 Chapter Summary
Leave-One-District-Out Cross-Validation
The resaerch used police districts as spatial groups for cross-validation.
For each fold: Took one district as the test set, then used all the other districts as the training set. Fitted a Negative Binomial model with the same predictors as before: mean_bldg_age, zone_code_dom, dist_to_hotspot, and knn_mean. After that, Predicted burglary counts for the test district and calculated absolute error and squared error for each grid cell.
This process was repeated for every police district, so each district was left out once.
Error Metrics (MAE and RMSE)
MAE (Mean Absolute Error): average size of errors.
RMSE (Root Mean Squared Error): penalizes large errors more strongly.
Mapping the Errors
These maps show that prediction accuracy is not uniform across the city: Some districts (especially in the north) have higher MAE/RMSE, meaning the model struggles there. Districts in the central and southern hotspot areas tend to have lower errors, where burglary patterns are stronger and easier to learn.
Purpose of Step 5
This step tests the model in a realistic spatial way: it predicts for entire districts that were never seen in training.
It reveals where the model is reliable and where it is weak, instead of only reporting a single global fit statistic.
The results show that the Negative Binomial model captures burglary patterns better in high-crime, hotspot districts, but has larger uncertainty in low-crime districts.
Step 6: Model Evaluation
6.1 KDE baseline
Code
# 1. Extract XY coordinates of burglary pointsbur_xy <-st_coordinates(burglaries)# 2. Create a bounding window from the Chicago boundarywin_bbox <- spatstat.geom::owin(xrange =st_bbox(chicagoBoundary)[c("xmin", "xmax")],yrange =st_bbox(chicagoBoundary)[c("ymin", "ymax")])# 3. Create a ppp object (point pattern) for burglariesbur_ppp <- spatstat.geom::ppp(x = bur_xy[, 1],y = bur_xy[, 2],window = win_bbox)# 4. Kernel density estimate for the point pattern# IMPORTANT: call density.ppp(bur_ppp, ...) — no "X ="bur_kde <- spatstat.explore::density.ppp( bur_ppp,sigma =1000# bandwidth in meters)# 5. Convert KDE to raster and extract values at grid centroidsbur_kde_raster <- raster::raster(bur_kde)fishnet_centroids <-st_coordinates(st_centroid(fishnet))fishnet$kde_value <- raster::extract( bur_kde_raster, fishnet_centroids)# 6. Replace NA values (outside the KDE window) with 0fishnet$kde_value[is.na(fishnet$kde_value)] <-0
6.2 Final model + predictions
Code
# Fit the final NB model using all available training cellsfinal_model <-glm.nb( countBurglaries ~ mean_bldg_age + knn_mean + dist_to_hotspot + zone_prefix_dom,data = fishnet_model)# Add NB predictions back to fishnet (matched by uniqueID)fishnet <- fishnet %>%mutate(prediction_nb =predict( final_model,newdata = fishnet_model,type ="response" )[match(uniqueID, fishnet_model$uniqueID)] )# Normalize KDE predictions to total burglary countkde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )fishnet <- fishnet %>%mutate(error_nb = countBurglaries - prediction_nb,abs_error_nb =abs(error_nb),error_kde = countBurglaries - prediction_kde,abs_error_kde =abs(error_kde) )
6.3 Compare Actual vs NB vs KDE
Code
#| fig-width: 12#| fig-height: 4# Actual countsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +scale_fill_viridis_c(name ="Count", option ="plasma", limits =c(0, 15) ) +labs(title ="Actual Burglaries (2017)") +theme_default()# Negative Binomial predictionsp2 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_nb), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15) ) +labs(title ="NB Model Predictions") +theme_default()# KDE baseline predictionsp3 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_kde), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15) ) +labs(title ="KDE Baseline Predictions") +theme_default()# Combine the three mapsp1 + p2 + p3 +plot_annotation(title ="Actual vs Predicted Burglaries",subtitle ="Comparing NB Model and KDE Baseline" )
All three maps show the same broad structure: higher burglary levels in the middle north and middle south parts of Chicago, and much lower activity in the middle, far north and far south part of the city.
The actual burglary map shows sharp, uneven hotspots. The NB model captures the broad hotspot regions and major regional contrasts, but produces smoother predictions because it relies on explanatory variables such as building age, zoning, and hotspot distance.
The KDE baseline looks visually similar to the actual map, but this is expected, as KDE simply smooths the observed data without any modeling of underlying urban processes.
6.4 Model performance comparison (MAE / RMSE)
Code
# Calculate MAE and RMSE for NB model and KDE baselinecomparison <- fishnet %>%st_drop_geometry() %>%filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%summarize(model_mae =mean(abs(countBurglaries - prediction_nb)),model_rmse =sqrt(mean((countBurglaries - prediction_nb)^2)),kde_mae =mean(abs(countBurglaries - prediction_kde)),kde_rmse =sqrt(mean((countBurglaries - prediction_kde)^2)) )# Reshape and print the tablecomparison %>%pivot_longer(everything(), names_to ="metric", values_to ="value" ) %>%separate(metric, into =c("approach", "metric"), sep ="_") %>%pivot_wider(names_from = metric, values_from = value) %>%kable(digits =2,caption ="Model Performance: NB vs KDE" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Model Performance: NB vs KDE
approach
mae
rmse
model
2.5
3.52
kde
2.1
2.98
KDE achieves lower MAE and RMSE because it directly smooths the observed burglary events, essentially reproducing the spatial pattern of the actual data. This makes KDE a strong spatial baseline but not a predictive or explanatory model.
In contrast, the Negative Binomial model aims to explain burglary variation using built-environment and zoning characteristics, rather than simply interpolating past events. Therefore, its higher error values are expected and do not indicate model failure. Instead, the NB model provides interpretability and generalization ability that KDE cannot offer.
Under-prediction happens more in major hotspot regions (especially central–south), while over-prediction appears more in low-crime northern areas.
This reflects the NB model’s reliance on built-environment variables, which smooth the predictions and do not fully capture local spikes.
NB Model: Absolute Error(Top-Right)
Absolute errors are larger mainly in districts with very high or very low burglary counts.
High-crime areas in the south show moderate errors but not extreme ones.
The NB model handles high-crime zones better because built-environment variables explain these areas more consistently.
KDE Baseline: Signed Error(Bottom-Left)
KDE errors are much more random and noisy.
The spatial error pattern lacks structure; KDE simply smooths nearby crime counts.
This randomness reflects that KDE is not a model. It only uses spatial proximity and does not incorporate zoning or urban characteristics.
KDE Baseline: Absolute Error(Bottom-Right)
Absolute errors are generally smaller than NB (expected), because KDE is effectively a smoothed version of the real data.
NB shows more structured errors because it predicts using explanatory variables. This is not a failure, it’s evidence that the NB model is capturing urban mechanisms, not just smoothing past data.
6.6 Chapter Summary
KDE Baseline:
This chapter built a KDE (Kernel Density Estimation) surface using the burglary point pattern.
KDE acts as a spatial baseline: it uses only the locations of past events, without any built-environment variables.
Final NB Model and Predictions:
In this chapter, I fitted a final Negative Binomial model using all training cells from fishnet_model, including mean_bldg_age, knn_mean, dist_to_hotspot, zone_prefix_dom.
Research generated NB predictions for each grid cell and stored them as prediction_nb. Meanwhile, this chapter used the KDE values to create a KDE prediction prediction_kde for each cell.
For both approaches, compute signed error and absolute error relative to the actual burglary counts.
Overall Interpretation of Step 6:
Step 6 evaluates whether the Negative Binomial model provides reasonable and useful predictions compared to a strong spatial baseline (KDE).
Results show that:
KDE achieves slightly lower MAE/RMSE because it effectively reproduces the observed pattern by smoothing the data.
The NB model offers interpretable, mechanism-based predictions, linking burglary risk to building age, zoning, rodent baiting requests hotspot distance, and its neighborhood feature, even though its errors are somewhat higher.
Therefore, KDE is useful as a benchmark for spatial clustering, while the NB model is more valuable for understanding and explaining where and why burglaries concentrate, and for generalizing to new times or scenarios.
Step7: Summary Statistics and Tables
7.1 Model Summary Table
Code
# Lookup full zoning labelszoning_labels <-c("RS"="Residential Single-Unit District","RT"="Residential Two-Flat, Townhouse and Multi-Unit District","RM"="Residential Multi-Unit District","B"="Business","C"="Commercial","DC"="Downtown Core District","DR"="Downtown Residential District","DS"="Downtown Service District","DX"="Downtown Mixed-Use District","M"="Manufacturing","PMD"="Planned Manufacturing Districts","PD"="Planned Development","T"="Transportation","POS"="Parks and Open Space")model_summary <- broom::tidy(final_model, exponentiate =TRUE) %>%mutate(term = dplyr::case_when( term =="(Intercept)"~"Intercept", term =="mean_bldg_age"~"Mean building age", term =="knn_mean"~"KNN mean (neighboring rodent baiting requests)", term =="dist_to_hotspot"~"Distance to rodent hotspot",# Zoning coefficients: zone_prefix_domRS, zone_prefix_domRTgrepl("^zone_prefix_dom", term) ~paste0("Zoning: ",sub("zone_prefix_dom", "", term), " (", zoning_labels[sub("zone_prefix_dom", "", term)],")" ),TRUE~ term ),# Round all numeric columns dplyr::across(where(is.numeric), ~round(.x, 3)) )model_summary %>%kable(caption ="Final Negative Binomial Model Coefficients (Exponentiated Rate Ratios)",col.names =c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value") ) %>%kable_styling(bootstrap_options =c("striped", "hover")) %>%footnote(general =paste("Rate ratios > 1 indicate a positive association with burglary counts,","while rate ratios < 1 indicate a negative association.","Zoning (land use) is a categorical variable; each zoning category is interpreted","relative to the reference category (the omitted baseline level of zone_prefix_dom)." ) )
Final Negative Binomial Model Coefficients (Exponentiated Rate Ratios)
Variable
Rate Ratio
Std. Error
Z
P-Value
Intercept
0.714
0.126
-2.667
0.008
Mean building age
1.020
0.001
16.204
0.000
KNN mean (neighboring rodent baiting requests)
1.003
0.002
1.613
0.107
Distance to rodent hotspot
1.000
0.000
-1.137
0.256
Zoning: C (Commercial)
0.896
0.089
-1.239
0.216
Zoning: DR (Downtown Residential District)
0.000
114839.703
0.000
1.000
Zoning: DS (Downtown Service District)
0.000
80757.872
0.000
1.000
Zoning: DX (Downtown Mixed-Use District)
1.391
0.359
0.921
0.357
Zoning: M (Manufacturing)
0.463
0.098
-7.891
0.000
Zoning: PD (Planned Development)
0.800
0.092
-2.432
0.015
Zoning: PMD (Planned Manufacturing Districts)
0.440
0.194
-4.233
0.000
Zoning: POS (Parks and Open Space)
0.771
0.156
-1.661
0.097
Zoning: RM (Residential Multi-Unit District)
1.107
0.112
0.908
0.364
Zoning: RS (Residential Single-Unit District)
0.865
0.058
-2.478
0.013
Zoning: RT (Residential Two-Flat, Townhouse and Multi-Unit District)
0.969
0.081
-0.393
0.694
Zoning: T (Transportation)
0.587
0.755
-0.706
0.480
Note:
Rate ratios > 1 indicate a positive association with burglary counts, while rate ratios < 1 indicate a negative association. Zoning (land use) is a categorical variable; each zoning category is interpreted relative to the reference category (the omitted baseline level of zone_prefix_dom).
Mean building age has a strong and significant positive effect. For each extra year of average building age in a grid cell, the expected burglary count increases by about 2%, holding other variables constant.
Manufacturing (M) and Planned Manufacturing Districts (PMD) both have p < 0.001, and their Rate Ratio both lower than 1, meaning that burglary risk in industrial zones is much lower than in the reference zoning category.
Planned development areas(PD)(RR ≈ 0.80, p ≈ 0.015) also show lower burglary risk.
Residential Single-Unit (RS)(RR ≈ 0.87, p ≈ 0.013) have somewhat lower burglary counts than the reference zoning.
Some downtown categories (DR, DS) have extreme rate ratios and huge standard errors, likely because there are very few grid cells of these types, so their coefficients should be interpreted with caution.
7.2 Discussion and Limitations
This study shows that the NB model can capture some broad burglary patterns, such as higher risk in areas with older buildings and lower risk in manufacturing zones. However, the model also has clear limitations. Some of the variables I included—such as rodent KNN and distance to rodent hotspots—were chosen somewhat loosely and do not strongly relate to burglary in the results. Because of this, the model cannot explain the sharp hotspots seen in the actual data. Overall, while the NB model offers some useful insights, its performance is restricted by both the choice of variables and the model structure, and future work should include more relevant predictors and more advanced spatial methods.
Source Code
---title: "Predictive Policing - Technical Implementation"subtitle: "Exploring Burglaries Patterns in Chicago and Predictive Modeling with 311 Data"author: "Yuqing(Demi) Yang"date: todayformat: html: code-fold: show code-tools: true toc: true toc-depth: 3 toc-location: left theme: cosmo embed-resources: trueeditor: visualexecute: warning: false message: false---## Assignment OverviewIn this lab, you will apply the spatial predictive modeling techniques demonstrated in the class exercise to a 311 service request type of your choice. You will build a complete spatial predictive model, document your process, and interpret your results.### Timeline & Deliverables**Due Date:** November 17, 2025, 10:00AM**Deliverable:** One rendered document, posted to your portfolio website.### Learning GoalsBy completing this assignment, you will demonstrate your ability to:- Adapt example code to analyze a new dataset- Build spatial features for predictive modeling- Apply count regression techniques to spatial data- Implement spatial cross-validation- Interpret and communicate model results- Critically evaluate model performance------------------------------------------------------------------------# Step 0: Set Up```{r}#| message: false#| warning: false# Load required packageslibrary(tidyverse) # Data manipulationlibrary(sf) # Spatial operationslibrary(here) # Relative file pathslibrary(viridis) # Color scaleslibrary(terra) # Raster operations (replaces 'raster')library(spdep) # Spatial dependencelibrary(FNN) # Fast nearest neighborslibrary(MASS) # Negative binomial regressionlibrary(patchwork) # Plot composition (replaces grid/gridExtra)library(knitr) # Tableslibrary(kableExtra) # Table formattinglibrary(classInt) # Classification intervalslibrary(here)# Spatstat split into sub-packageslibrary(spatstat.geom) # Spatial geometrieslibrary(spatstat.explore) # Spatial exploration/KDE# Set optionsoptions(scipen =999) # No scientific notationset.seed(5080) # Reproducibility# Create consistent theme for visualizationstheme_default <-function(base_size =11) {theme_minimal(base_size = base_size) +theme(plot.title =element_text(face ="bold", size = base_size +1),plot.subtitle =element_text(color ="gray30", size = base_size -1),legend.position ="right",panel.grid.minor =element_blank(),axis.text =element_blank(),axis.title =element_blank() )}# Set as defaulttheme_set(theme_default())cat("✓ All packages loaded successfully!\n")```# Step 1: Getting the Data## 1.1 Load Chicago Spatial Data```{r}#| message: false#| warning: false#| results: hide#| # Load police districts (used for spatial cross-validation)policeDistricts <-st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(District = dist_num)# Load police beats (smaller administrative units)policeBeats <-st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%st_transform('ESRI:102271') %>% dplyr::select(Beat = beat_num)# Load Chicago boundarychicagoBoundary <-st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%st_transform('ESRI:102271')cat(" - Police districts:", nrow(policeDistricts), "\n")cat(" - Police beats:", nrow(policeBeats), "\n")```## 1.2 Load 311 Rodent Baiting Data```{r}#| message: false#| warning: false#| results: hiderb <-read_csv("data/311_Service_Requests_-_Rodent_Baiting_-_Historical_20251114.csv")head(rb)rb_sf <- rb %>%filter(!is.na(Latitude), !is.na(Longitude)) %>%st_as_sf(coords =c("Longitude", "Latitude"), crs =4326) %>%st_transform("ESRI:102271") rb_sf <- rb_sf %>%filter(st_within(., chicagoBoundary, sparse =FALSE))# filter 2017 datarb_sf$CreationDate <-as.Date(rb_sf$`Creation Date`, format ="%m/%d/%Y")rb_sf_2017 <- rb_sf %>%filter(format(CreationDate, "%Y") =="2017")```## 1.3 Load Burglary Data```{r}#| message: false#| warning: false#| results: hide#| # Load from provided data file (downloaded from Chicago open data portal)burglaries <-st_read("data/burglaries.shp") %>%st_transform('ESRI:102271')# Check the datacat(" - Number of burglaries:", nrow(burglaries), "\n")cat(" - CRS:", st_crs(burglaries)$input, "\n")```## 1.4 Visualize Point Data### 1.4.1 Rodent Baiting Data```{r}#| fig-width: 10#| fig-height: 5# Simple point map for Rodent Baitingp1 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = rb_sf_2017, color ="#d62828", size =0.1, alpha =0.4) +labs(title ="Rodent Baiting 311 Service Requests",subtitle =paste0("Chicago, n = ", nrow(rb_sf_2017)) )# Density surface using 311 Rodent Baiting datap2 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(rb_sf_2017)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none" ) +labs(title ="Density Surface",subtitle ="Kernel density estimation of Rodent Baiting Requests" )# Combine plots using patchworkp1 + p2 +plot_annotation(title ="Spatial Distribution of Rodent Baiting 311 Requests in Chicago",tag_levels ="A" )```- The maps above show the spatial distribution of Rodent Baiting 311 requests across Chicago.\- The point map on the left illustrates the raw locations of all reported rodent issues. The large number of points makes the overall pattern dense, but we can still see clear clusters in the northern and western parts of the city.\- The density map on the right provides a smoother view of these patterns. Several strong hotspots emerge, especially in the Northwest Side and Near North areas. In contrast, the southern and far southwestern parts of Chicago show much lower intensity.### 1.4.2 Burgalaries Data```{r}# Simple point map for Burglaries (P3)p3 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_sf(data = burglaries, color ="#003f5c", size =0.15, alpha =0.5) +labs(title ="Residential Burglaries",subtitle =paste0("Chicago, n = ", nrow(burglaries)) )# Density surface for Burglaries (P4)p4 <-ggplot() +geom_sf(data = chicagoBoundary, fill ="gray95", color ="gray60") +geom_density_2d_filled(data =data.frame(st_coordinates(burglaries)),aes(X, Y),alpha =0.7,bins =8 ) +scale_fill_viridis_d(option ="plasma",direction =-1,guide ="none" ) +labs(title ="Density Surface",subtitle ="Kernel density estimation of Residential Burglaries" )# Combine P3 & P4p3 + p4 +plot_annotation(title ="Spatial Distribution of Residential Burglaries in Chicago",tag_levels ="A" )```- The maps show where residential burglaries were concentrated in Chicago.- The darker purple areas represent the highest concentrations of burglaries. These hot spots appear mainly in the north and south-central parts of the city.- Overall, the maps suggest that residential burglaries cluster in specific neighborhoods rather than occurring randomly across Chicago.## 1.5 Chapter Summary- **Load maps and boundaries of Chicago:** - This chapter loads police districts, police beats, and the city boundary.\ - These layers help define the study area and later allow spatial analysis.\ - All maps are changed into the same coordinate system so they line up correctly.- **Load 311 Rodent Baiting data:** - This chapter loads Chicago 311 Rodent Baiting data, and keeps only records from 2017.- **Load burglary data:** - A shapefile of burglary cases is loaded.\ - It is also transformed into the same coordinate system.\ - This dataset will be compared with the rodent data later.- **Purpose of Step 1:** - Gather most of the data needed for the project.\ - Clean and filter the data so it is accurate and consistent.\ - Make basic maps to understand where events happen.# Step 2: Fishnet Grid Creation## 2.1 Create Fishnet Grid```{r}# Create 500m x 500m gridfishnet <-st_make_grid( chicagoBoundary,cellsize =500, # 500 meters per cellsquare =TRUE) %>%st_sf() %>%mutate(uniqueID =row_number())# Keep only cells that intersect Chicagofishnet <- fishnet[chicagoBoundary, ]# View basic infocat(" - Number of cells:", nrow(fishnet), "\n")cat(" - Cell size:", 500, "x", 500, "meters\n")cat(" - Cell area:", round(st_area(fishnet[1,])), "square meters\n")```## 2.2 Aggregate data to Grid```{r}# Spatial join: which cell contains each rodent baiting request/burglaries?rodent_fishnet <-st_join(rb_sf_2017, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarise(countRodent =n(), .groups ="drop")burglary_fishnet <-st_join(burglaries, fishnet, join = st_within) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarise(countBurglaries =n(), .groups ="drop")# Join back to fishnet (cells with 0 requests will be NA)fishnet <- fishnet %>%left_join(rodent_fishnet, by ="uniqueID") %>%left_join(burglary_fishnet, by ="uniqueID") %>%mutate(countRodent = tidyr::replace_na(countRodent, 0),countBurglaries = tidyr::replace_na(countBurglaries, 0) )```- Summary statistics for Rodent baiting request```{r}print(summary(fishnet$countRodent))cat("\nCells with zero rodent requests:",sum(fishnet$countRodent ==0), "/",nrow(fishnet), "(",round(100*sum(fishnet$countRodent ==0) /nrow(fishnet), 1), "% )\n")```- Summary statistics for Burglary```{r}print(summary(fishnet$countBurglaries))cat("\nCells with zero burglaries:",sum(fishnet$countBurglaries ==0), "/",nrow(fishnet), "(",round(100*sum(fishnet$countBurglaries ==0) /nrow(fishnet), 1), "% )\n")```## 2.3 Visualize Rodent Baiting with fishnet```{r}ggplot() +geom_sf(data = fishnet, aes(fill = countRodent), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Rodent Baiting Requests",option ="plasma",trans ="sqrt",breaks =c(0, 5, 10, 20, 40, 80, 180),limits =c(0, 180), oob = scales::squish ) +labs(title ="Rodent Baiting 311 Requests by Grid Cell",subtitle ="500m x 500m cells, Chicago,2017" ) +theme_default()quant_rb <-tibble(Percentile =c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"),Value =quantile(fishnet$countRodent, probs =seq(0,1,0.1)))knitr::kable( quant_rb,caption ="Quantile Distribution of Rodent Baiting 311 Request Counts")```- Most grid cells have very low request counts. According to the quantile table, 30% of the cells have 3 or fewer requests, and 50% have 11 or fewer.\- This pattern shows a strong **right-skewed distribution**, where a small number of hotspots generate many more requests than the rest of the city.\- On the map, hotspots appear mainly in the **central and north neighborhoods**, forming clear clusters of higher counts, and suggesting strong spatial heterogeneity.## 2.4 Visualize Burgalaries with fishnet```{r}# Burglary 500m x 500m grid mapggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Residential Burglaries",option ="plasma",trans ="sqrt",breaks =c(0, 1, 2, 3, 5, 8, 40),limits =c(0, 40),oob = scales::squish ) +labs(title ="Residential Burglaries by Grid Cell",subtitle ="500m x 500m cells, Chicago, 2017" ) +theme_default()quant_burglary <-tibble(Percentile =c("0%", "10%", "20%", "30%", "40%", "50%", "60%", "70%", "80%", "90%", "100%"),Value =quantile(fishnet$countBurglaries, probs =seq(0, 1, 0.1), na.rm =TRUE))knitr::kable( quant_burglary,caption ="Quantile Distribution of Residential Burglary Counts")```- Most grid cells have **very few burglaries**. From the quantiles, 50% of all cells have 2 or fewer burglary cases.\- The distribution is **right-skewed**. Many places have low values (0–3), while only a few places have high values.\- On the map, brighter colors (yellow) show where burglaries are concentrated. These hotspots appear mainly in the northern and southern parts of Chicago.\- The pattern suggests burglary risk is **not uniform** across the city. Instead, it clusters in specific neighborhoods, which is important for modeling and prediction.## 2.5 Chapter Summary- **Creating the Fishnet Grid:** - This chapter created a 500m × 500m grid that covers the whole Chicago boundary.\ - Each grid cell gets a unique ID, which helps match spatial features later.\ - The grid makes it easier to compare different locations using the same unit of space.- **Aggregating Rodent and Burglary Data:** - Each 311 rodent request and burglary point is assigned to a grid cell using a spatial join.\ - Then, this chapter counted how many events happen in each cell.\ - Cells with no events are filled with zero, so the dataset stays complete.\ - This step converts point data into count data, which is needed for statistical modeling.- **Mapping Rodent Baiting Counts:** - The map shows clear hotspots where rodent activity is high.\ - The distribution is very skewed, meaning most cells have low values, but a few cells have extremely high values.\ - This pattern suggests strong spatial clustering.- **Mapping Burglary Counts:** - Burglary counts also show uneven spatial distribution.\ - Many cells have 0–3 burglaries, while only a small number of cells have high counts.\ - This map confirms that burglaries are not random but cluster in specific neighborhoods.- **Overall Importance of Step 2:** - This chapter transforms raw point data into a structured spatial dataset. - It creates the base layer (the grid) where all future variables will be attached. - It also reveals important spatial patterns that guide later modeling steps.# Step 3: Spatial Features## 3.1 k-nearest neighbor```{r}# 1. Get centroid coordinatesfishnet_coords <-st_coordinates(st_centroid(fishnet))# 2. Compute nearest 8 neighbors (to other grid cells)knn_result <-get.knn(fishnet_coords, k =8)# 3. Add KNN features to fishnetfishnet <- fishnet %>%mutate(knn_mean =rowMeans(matrix(countRodent[knn_result$nn.index], nrow =nrow(fishnet))),knn_sum =rowSums(matrix(countRodent[knn_result$nn.index], nrow =nrow(fishnet))) )summary(fishnet$knn_mean)summary(fishnet$knn_sum)```- I calculated the **summary and mean values of neighboring grid cells** instead of the distance nearest event points. This method captures spatial spillover between neighborhoods better, providing stronger and more reliable predictive power.## 3.2 Distance to Hot Spots### 3.2.1 Perform Local Moran's I analysis```{r}# Function to calculate Local Moran's Icalculate_local_morans <-function(data, variable, k =5) {# Create spatial weights based on k-nearest neighbors coords <-st_coordinates(st_centroid(data)) neighbors <-knn2nb(knearneigh(coords, k = k)) weights <-nb2listw(neighbors, style ="W", zero.policy =TRUE)# Calculate Local Moran's I local_moran <-localmoran(data[[variable]], weights)# Classify clusters mean_val <-mean(data[[variable]], na.rm =TRUE) data %>%mutate(local_i = local_moran[, 1],p_value = local_moran[, 5],is_significant = p_value <0.05,moran_class =case_when(!is_significant ~"Not Significant", local_i >0& .data[[variable]] > mean_val ~"High-High", local_i >0& .data[[variable]] <= mean_val ~"Low-Low", local_i <0& .data[[variable]] > mean_val ~"High-Low", local_i <0& .data[[variable]] <= mean_val ~"Low-High",TRUE~"Not Significant" ) )}# Apply to rodent baiting countsfishnet <-calculate_local_morans(fishnet, "countRodent", k =5)```### 3.2.2 Visualize Morans```{r}#| fig-width: 8#| fig-height: 6# Visualize Local Moran's I clusters for rodent baitingggplot() +geom_sf(data = fishnet, aes(fill = moran_class), color =NA ) +scale_fill_manual(values =c("High-High"="#d7191c", # hotspots"High-Low"="#fdae61","Low-High"="#abd9e9","Low-Low"="#2c7bb6", # cold spots"Not Significant"="gray90" ),name ="Cluster Type" ) +labs(title ="Local Moran's I: Rodent Baiting Clusters",subtitle ="High-High = Hot spots of rodent activity" ) +theme_default()```- **Why there is no "Low-Low" and "High-Low" clusters?** - The data shows a highly right-skewed distribution.\ - When high-value areas are clustered together, they are very likely to be identified as High-High (hotspot).\ - Low-value areas are everywhere, so "low + low + low" is too "normal" for the tests. As a result, it will not be marked as significantly "Low-Low", but rather "Not Significant". - In the reality, rodent activity is concentrated in a small number of hotspots, while low counts are widespread across the city.### 3.2.3 Distance to Hotspots```{r}# Get centroids of "High-High" cells (rodent hotspots)hotspots <- fishnet %>%filter(moran_class =="High-High") %>%st_centroid()# Calculate distance from each cell to nearest rodent hotspotif (nrow(hotspots) >0) { fishnet <- fishnet %>%mutate(dist_to_hotspot =as.numeric(st_distance(st_centroid(fishnet), hotspots %>%st_union()) ) )cat(" - Number of hot spot cells:", nrow(hotspots), "\n")} else { fishnet <- fishnet %>%mutate(dist_to_hotspot =0)cat("⚠ No significant rodent baiting hot spots found\n")}```## 3.3 Building Age### 3.3.1 Building Age Feature```{r}#| message: false#| warning: false# Read building databuilding_sf <-st_read("data/Building_Footprints_20251117.geojson") %>%# buildings_sf <- st_read(# "https://data.cityofchicago.org/resource/syp8-uezg.geojson",# quiet = TRUE# ) %>%# Clean and calculate the building years of 2017filter(!is.na(year_built)) %>%mutate(year_built =as.numeric(year_built)) %>%filter(year_built >0, year_built <=2017) %>%mutate(bldg_age_2017 =2017- year_built) %>%st_transform('ESRI:102271')bldg_by_grid <-st_join(fishnet, building_sf, join = st_intersects) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(n_buildings =sum(!is.na(bldg_age_2017)),mean_bldg_age =ifelse( n_buildings ==0, NA_real_,mean(bldg_age_2017, na.rm =TRUE) ),median_bldg_age =ifelse( n_buildings ==0, NA_real_,median(bldg_age_2017, na.rm =TRUE) ),.groups ="drop" )fishnet <- fishnet %>%left_join(bldg_by_grid, by ="uniqueID")```### 3.3.2 Visualize Building Age```{r}#| fig-width: 8#| fig-height: 6ggplot() +geom_sf(data = fishnet, aes(fill = mean_bldg_age), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_c(name ="Mean building age (years)",option ="magma",direction =-1 ,na.value ="grey90" ) +labs(title ="Average Building Age by Grid Cell",subtitle ="500m x 500m cells, Chicago, 2017" ) +theme_default()```- The map shows large differences in building age across Chicago.\- Newer buildings appear more often in the **central and far north** areas, where the colors are lighter.\- The pattern suggests that building age is not evenly distributed.Older structures cluster in certain neighborhoods, while others have more recent development.\- Many **burglary hotspots** appear in areas where **buildings are older**. - These locations, mainly in the north, and southeast parts of Chicago—show both high burglary counts and high building age.\ - Older buildings may have weaker physical security, such as outdated doors, windows, or locks.\ - Older neighborhoods may also have denser housing, which increases opportunities for burglary.## 3.4 Land Use### 3.4.1 Land Use Feature```{r}#| message: false#| warning: false# Read the zoning data# zoning_sf <- st_read(# "https://data.cityofchicago.org/resource/dj47-wfun.geojson",# quiet = TRUE# )zoning_sf <-st_read("data/Boundaries_-_Zoning_Districts_(current)_20251117.geojson")# Extract the letter parts of the first three characters from ZONE_CLASSzoning_sf <- zoning_sf %>%mutate(zone_prefix =substr(zone_class, 1, 3),zone_prefix =gsub("[^A-Za-z]", "", zone_prefix) ) %>%filter(zone_prefix !="") %>%mutate(zone_code = dplyr::recode( zone_prefix,"RS"="1","RT"="2","RM"="3","B"="4","C"="5","DC"="6","DR"="7","DS"="8","DX"="9","M"="10","PMD"="11","PD"="12","T"="13","POS"="14",.default =NA_character_ ) ) %>%filter(!is.na(zone_code)) %>%st_transform("ESRI:102271")zoning_sf$zone_code <-as.integer(zoning_sf$zone_code)# Calculate the modemode_first <-function(x) { x <- x[!is.na(x)]if (length(x) ==0) return(NA) ux <-unique(x) ux[which.max(tabulate(match(x, ux)))]}# Aggregate to fishnet: One zoning type for each gridzoning_by_grid <-st_join(fishnet, zoning_sf, join = st_intersects) %>%st_drop_geometry() %>%group_by(uniqueID) %>%summarize(zone_prefix_dom =mode_first(zone_prefix), zone_code_dom =mode_first(zone_code),.groups ="drop" )# join the zoning variable back to fishnetfishnet <- fishnet %>%left_join(zoning_by_grid, by ="uniqueID")```### 3.4.2 Visualize Land Use```{r}# Create a zoning classification with complete instructionsfishnet <- fishnet %>%mutate(zone_cat =factor( zone_prefix_dom,levels =c("RS", "RT", "RM", "B", "C", "DC", "DR", "DS", "DX","M", "PMD", "PD", "T", "POS"),labels =c("RS = Residential Single-Unit District","RT = Residential Two-Flat, Townhouse and Multi-Unit District","RM = Residential Multi-Unit District","B = Business","C = Commercial","DC = Downtown Core District","DR = Downtown Residential District","DS = Downtown Service District","DX = Downtown Mixed-Use District","M = Manufacturing","PMD = Planned Manufacturing Districts","PD = Planned Development","T = Transportation","POS = Parks and Open Space" ) ) )#| fig-width: 8#| fig-height: 6ggplot() +geom_sf(data = fishnet, aes(fill = zone_cat), color =NA) +geom_sf(data = chicagoBoundary, fill =NA, color ="white", linewidth =1) +scale_fill_viridis_d(name ="Zoning (dominant class)",option ="turbo", na.value ="grey90" ) +labs(title ="Dominant Zoning Category by Grid Cell",subtitle ="500m x 500m cells, Chicago" ) +theme_default() +theme(legend.key.height =unit(0.6, "cm"),legend.key.width =unit(0.6, "cm") )```- The map shows the dominant zoning type in each 500m × 500m grid cell in Chicago.\- The map shows Chicago has a mixed pattern, but **residential zones** dominate most of the city.\- Downtown districts (DR, DS, DX) are concentrated in the central east area.\- Manufacturing (M, PMD) zones cluster in the southeast, and middlewest.\- Comparing with burglary distribution, burglary hotspots occur mostly in **residential zones**. - This makes sense because burglary is a residential crime, so it appears mainly where people live.\- **Manufacturing zones** have low burglary counts. - These areas have fewer homes, more industrial facilities, and lower residential density.## 3.5 Chapter Summary- **k-Nearest Neighbor (KNN) Feature for 311 request feature:** - For every cell, this chapter found its 8 nearest neighboring cells.\ - Then this chapter calculated `knn_mean`, which is the average rodent request count in those neighbors, and `knn_sum`, which is the total rodent request count in those neighbors.- **Distance to Hot Spots Feature(Local Moran’s I):** - Each grid cell was classified into: `High–High (hotspot)`, `Low–Low`, `High–Low`, `Low–High`, or Not Significant. In practice, most significant clusters are High–High hotspots, while many low-count areas are “Not Significant” instead of Low–Low.\ - Then this chapter took the centroids of High–High cells and measured, for every grid cell, the distance to the nearest hotspot (`dist_to_hotspot`).\ - This creates a feature that shows how close each place is to intense rodent activity.- **Building Age Feature:** - Building data was loaded from `Chicago Data Portal`, named `Building Footprints`.\ - This chapter cleaned the data and kept only buildings built before or in 2017, then computed building age in 2017. - Then, for each grid cell, this chapter calculated: `n_buildings`,which is this number of buildings with valid age. `mean_bldg_age`, which is the average building age. `median_bldg_age`, which is the median building age.\ - These values were joined back to the fishnet, so every cell has information about its building stock and age structure.- **Land Use Feature:** - Zoning data was loaded from `Chicago Data Portal`, named `Boundaries - Zoning Districts (current)`.\ - This chapter aggregated the zoning data into the secondary classification of zoning classes.(reference: *https://secondcityzoning.org/zones/*)\ - For each cell, this chapter identified its function as its dominant zoning type.\ - A map of these categories shows how residential, commercial, downtown, industrial, and park areas are distributed across Chicago.- **Purpose of Step 3:** - Step 3 builds key spatial predictor variables on top of the grid: neighborhood intensity (`knn_mean`), proximity to rodent hotspots (`dist_to_hotspot`), built environment age (`mean_bldg_age`), and land use type (`zone_code_dom`).\ - They are later used in the Poisson and Negative Binomial regression models as main explanatory variables.# Step 4: Count Regression Models## 4.1 Poisson regression### 4.1.1 Fit Poisson regression```{r}#| label: poisson-reg#| message: false#| warning: false#| results: hidefishnet$zone_code_dom <-as.factor(fishnet$zone_code_dom)pois_model <-glm( countBurglaries ~ mean_bldg_age + zone_prefix_dom + dist_to_hotspot + knn_mean,data = fishnet,family ="poisson")summary(pois_model)```### 4.1.2 Check for Overdispersion```{r}#| label: poisson-dispersion-check#| message: false#| warning: falsedispersion_rb <-sum(residuals(pois_model, type ="pearson")^2) / pois_model$df.residualcat("Dispersion parameter for rodent baiting Poisson model:",round(dispersion_rb, 2), "\n")cat("Rule of thumb: values > 1.5 indicate overdispersion.\n")if (dispersion_rb >1.5) {cat("⚠ Overdispersion detected for rodent baiting counts.\n")cat(" The Poisson model is likely too restrictive.\n")cat(" A Negative Binomial model is more appropriate.\n")} else {cat("✓ Dispersion looks acceptable.\n")cat(" The Poisson model may be adequate for these counts.\n")}```## 4.2 Fit Negative Binomial regression```{r}#| label: nb-reg#| message: false#| warning: false#| results: hidenb_model <-glm.nb( countBurglaries ~ mean_bldg_age + zone_prefix_dom + dist_to_hotspot + knn_mean,data = fishnet)summary(nb_model)```## 4.3 Compare model fit (AIC)```{r}#| label: compare-aic#| message: false#| warning: falsemodel_compare <-data.frame(Model =c("Poisson", "Negative Binomial"),AIC =c(AIC(pois_model), AIC(nb_model)))knitr::kable(model_compare, caption ="Model Comparison: Poisson vs. NB")dispersion_row <-data.frame(Statistic ="Poisson Dispersion",Value = dispersion_rb)knitr::kable(dispersion_row, caption ="Poisson Dispersion Statistic")```## 4.4 Chapter Summary- In this chapter, I built statistical models to explain how burglary counts change with spatial features:\ building age, zoning, distance to rodent baiting counts hotspots, and mean values of rodent baiting counts in neighboring grid cells.\- Then, this chapter first fitted a Poisson regression with these variables and then checked the dispersion of the residuals. The **Poisson dispersion is about 3.47**, which is much larger than 1.5. This means the data are strongly overdispersed (variance much higher than the mean), so the Poisson model is too restrictive.\- To handle overdispersion, research then fitted a Negative Binomial (NB) regression with the same main predictors.This model allows **extra variance** and is better suited for these burglary counts.\- Poisson model has an **AIC of about 11849**, while the Negative Binomial model has an **AIC of about 9837**. A lower AIC means a better fit, so the **Negative Binomial model** performs much better.\- Overall, Step 4 shows that burglary counts are **highly overdispersed**, and that the Negative Binomial regression is the appropriate model for capturing the relationship between burglaries and the spatial features created in previous steps.# Step 5: Spatial Cross-Validation## 5.1 Leave-One-Group-Out cross-validation on 2017 data```{r}#| message: false#| warning: false#| results: hide# Calculate which police district the centroid of the grid falls infishnet_centroids <-st_centroid(fishnet) %>%st_join(policeDistricts, join = st_within) %>%st_drop_geometry() %>% dplyr::select(uniqueID, District)# Construct a dataset for cross-validationfishnet_model <- fishnet %>%left_join(fishnet_centroids, by ="uniqueID") %>%filter(!is.na(District)) %>%mutate(District =as.factor(District),zone_code_dom =as.factor(zone_code_dom) ) %>%filter(!is.na(countRodent),!is.na(mean_bldg_age),!is.na(zone_code_dom),!is.na(knn_mean),!is.na(dist_to_hotspot) )# LOGO CVdistricts <-unique(fishnet_model$District)cv_results <-tibble()for (i inseq_along(districts)) { test_district <- districts[i]# Divide the training set/test set train_data <- fishnet_model %>%filter(District != test_district) test_data <- fishnet_model %>%filter(District == test_district) train_data <- train_data %>%mutate(zone_code_dom =factor(zone_code_dom)) test_data <- test_data %>%mutate(zone_code_dom =factor(zone_code_dom,levels =levels(train_data$zone_code_dom)))# Fit the NB modelmodel_cv <- MASS::glm.nb( countBurglaries ~ mean_bldg_age + zone_code_dom + dist_to_hotspot + knn_mean,data = train_data )# Make predictions on the test police district test_data <- test_data %>%mutate(prediction =predict(model_cv, newdata = ., type ="response"),abs_error =abs(countRodent - prediction),sq_error = (countRodent - prediction)^2 ) test_eval <- test_data %>%filter(!is.na(prediction)) mae <-mean(test_eval$abs_error, na.rm =TRUE) rmse <-sqrt(mean(test_eval$sq_error, na.rm =TRUE))# Save the resultcv_results <-bind_rows( cv_results,tibble(fold = i,test_district =as.character(test_district),n_test =nrow(test_eval),mae = mae,rmse = rmse ) )cat(" Fold", i, "/", length(districts),"- District", as.character(test_district),"- n_test:", nrow(test_eval),"- MAE:", round(mae, 2),"- RMSE:", round(rmse, 2), "\n")}```## 5.2 Calculate and report error metrics (MAE, RMSE)```{r}cv_sorted <- cv_results %>%arrange(fold)# Table 1: local MAE and RMSEcv_sorted %>% knitr::kable(digits =2,caption ="Leave-One-District-Out CV Results (per district)" ) %>% kableExtra::kable_styling(bootstrap_options =c("striped", "hover"))# Table 2: Average MAE and RMSecv_summary <- cv_sorted %>%summarise(mean_MAE =mean(mae, na.rm =TRUE),mean_RMSE =mean(rmse, na.rm =TRUE),n_folds = dplyr::n() )cv_summary %>% knitr::kable(digits =2,caption ="Average MAE and RMSE across all districts (LOGO CV)" ) %>% kableExtra::kable_styling(bootstrap_options =c("striped", "hover"))```## 5.3 Visualize error metrics```{r}cv_by_district <- cv_results %>%group_by(test_district) %>%summarise(mae =mean(mae, na.rm =TRUE),rmse =mean(rmse, na.rm =TRUE),.groups ="drop" )# Merge with the police district base mappolice_cv <- policeDistricts %>%mutate(District =as.character(District)) %>%left_join(cv_by_district, by =c("District"="test_district"))# RMSE spatial distributionggplot(police_cv) +geom_sf(aes(fill = rmse)) +labs(title ="LOGO CV RMSE by Police District",fill ="RMSE" ) +theme_minimal()# MAE spatial distributionggplot(police_cv) +geom_sf(aes(fill = mae)) +labs(title ="LOGO CV MAE by Police District",fill ="MAE" ) +theme_minimal()```- Prediction errors **vary strongly** across the city. - Both RMSE and MAE maps show that some police districts have much higher prediction errors than others.\ - The lightest areas (middle and far-north districts) have RMSE \> 60 and MAE \> 50, meaning the model struggles most in these regions.- **High-error** districts overlap with **low burglary** areas - The earlier burglary map showed that the northern parts of Chicago have very low burglary counts. These same areas now show the highest prediction errors.## 5.4 Chapter Summary- **Leave-One-District-Out Cross-Validation** - The resaerch used **police districts** as spatial groups for cross-validation.\ - For each fold: Took one district as the test set, then used all the other districts as the training set. Fitted a Negative Binomial model with the same predictors as before: `mean_bldg_age`, `zone_code_dom`, `dist_to_hotspot`, and `knn_mean`. After that, Predicted burglary counts for the test district and calculated absolute error and squared error for each grid cell.\ - This process was repeated for **every police district**, so each district was left out once.- **Error Metrics (MAE and RMSE)** - MAE (Mean Absolute Error): average size of errors.\ - RMSE (Root Mean Squared Error): penalizes large errors more strongly.- **Mapping the Errors** - These maps show that prediction accuracy is **not uniform** across the city: Some districts (**especially in the north**) have higher MAE/RMSE, meaning the model struggles there. Districts in the **central and southern hotspot areas** tend to have lower errors, where burglary patterns are stronger and easier to learn.- **Purpose of Step 5** - This step tests the model in a realistic spatial way: it predicts for entire **districts that were never seen in training**. - It reveals where the model is reliable and where it is weak, instead of only reporting a single global fit statistic. - The results show that the Negative Binomial model captures burglary patterns better in high-crime, hotspot districts, but has larger uncertainty in low-crime districts.# Step 6: Model Evaluation## 6.1 KDE baseline```{r}# 1. Extract XY coordinates of burglary pointsbur_xy <-st_coordinates(burglaries)# 2. Create a bounding window from the Chicago boundarywin_bbox <- spatstat.geom::owin(xrange =st_bbox(chicagoBoundary)[c("xmin", "xmax")],yrange =st_bbox(chicagoBoundary)[c("ymin", "ymax")])# 3. Create a ppp object (point pattern) for burglariesbur_ppp <- spatstat.geom::ppp(x = bur_xy[, 1],y = bur_xy[, 2],window = win_bbox)# 4. Kernel density estimate for the point pattern# IMPORTANT: call density.ppp(bur_ppp, ...) — no "X ="bur_kde <- spatstat.explore::density.ppp( bur_ppp,sigma =1000# bandwidth in meters)# 5. Convert KDE to raster and extract values at grid centroidsbur_kde_raster <- raster::raster(bur_kde)fishnet_centroids <-st_coordinates(st_centroid(fishnet))fishnet$kde_value <- raster::extract( bur_kde_raster, fishnet_centroids)# 6. Replace NA values (outside the KDE window) with 0fishnet$kde_value[is.na(fishnet$kde_value)] <-0```## 6.2 Final model + predictions```{r}# Fit the final NB model using all available training cellsfinal_model <-glm.nb( countBurglaries ~ mean_bldg_age + knn_mean + dist_to_hotspot + zone_prefix_dom,data = fishnet_model)# Add NB predictions back to fishnet (matched by uniqueID)fishnet <- fishnet %>%mutate(prediction_nb =predict( final_model,newdata = fishnet_model,type ="response" )[match(uniqueID, fishnet_model$uniqueID)] )# Normalize KDE predictions to total burglary countkde_sum <-sum(fishnet$kde_value, na.rm =TRUE)count_sum <-sum(fishnet$countBurglaries, na.rm =TRUE)fishnet <- fishnet %>%mutate(prediction_kde = (kde_value / kde_sum) * count_sum )fishnet <- fishnet %>%mutate(error_nb = countBurglaries - prediction_nb,abs_error_nb =abs(error_nb),error_kde = countBurglaries - prediction_kde,abs_error_kde =abs(error_kde) )```## 6.3 Compare Actual vs NB vs KDE```{r}#| fig-width: 12#| fig-height: 4# Actual countsp1 <-ggplot() +geom_sf(data = fishnet, aes(fill = countBurglaries), color =NA) +scale_fill_viridis_c(name ="Count", option ="plasma", limits =c(0, 15) ) +labs(title ="Actual Burglaries (2017)") +theme_default()# Negative Binomial predictionsp2 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_nb), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15) ) +labs(title ="NB Model Predictions") +theme_default()# KDE baseline predictionsp3 <-ggplot() +geom_sf(data = fishnet, aes(fill = prediction_kde), color =NA) +scale_fill_viridis_c(name ="Predicted", option ="plasma", limits =c(0, 15) ) +labs(title ="KDE Baseline Predictions") +theme_default()# Combine the three mapsp1 + p2 + p3 +plot_annotation(title ="Actual vs Predicted Burglaries",subtitle ="Comparing NB Model and KDE Baseline" )```- All three maps show the same broad structure: higher burglary levels in the **middle north and middle south** parts of Chicago, and much lower activity in the **middle, far north and far south** part of the city.- The actual burglary map shows sharp, uneven hotspots. The NB model captures the broad hotspot regions and major regional contrasts, but **produces smoother predictions** because it relies on **explanatory variables** such as building age, zoning, and hotspot distance.\- The KDE baseline looks visually similar to the actual map, but this is expected, as KDE **simply smooths the observed data** without any modeling of underlying urban processes.## 6.4 Model performance comparison (MAE / RMSE)```{r}# Calculate MAE and RMSE for NB model and KDE baselinecomparison <- fishnet %>%st_drop_geometry() %>%filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%summarize(model_mae =mean(abs(countBurglaries - prediction_nb)),model_rmse =sqrt(mean((countBurglaries - prediction_nb)^2)),kde_mae =mean(abs(countBurglaries - prediction_kde)),kde_rmse =sqrt(mean((countBurglaries - prediction_kde)^2)) )# Reshape and print the tablecomparison %>%pivot_longer(everything(), names_to ="metric", values_to ="value" ) %>%separate(metric, into =c("approach", "metric"), sep ="_") %>%pivot_wider(names_from = metric, values_from = value) %>%kable(digits =2,caption ="Model Performance: NB vs KDE" ) %>%kable_styling(bootstrap_options =c("striped", "hover"))```- KDE achieves lower MAE and RMSE because it **directly smooths the observed burglary events**, essentially reproducing the spatial pattern of the actual data. This makes KDE a **strong spatial baseline** but not a predictive or explanatory model.\- In contrast, the Negative Binomial model aims to **explain burglary variation using built-environment and zoning characteristics**, rather than simply interpolating past events. Therefore, its higher error values are expected and do not indicate model failure. Instead, the NB model provides interpretability and generalization ability that KDE cannot offer.## 6.5 Spatial Distribution of Prediction Errors```{r}#| fig-width: 12#| fig-height: 8# Signed error (NB)p_nb_err <-ggplot() +geom_sf(data = fishnet, aes(fill = error_nb), color =NA) +scale_fill_gradient2(name ="Error",low ="#2166ac",mid ="white",high ="#b2182b",midpoint =0,limits =c(-10, 10) ) +labs(title ="NB Model: Signed Error") +theme_default()# Absolute error (NB)p_nb_abs <-ggplot() +geom_sf(data = fishnet, aes(fill = abs_error_nb), color =NA) +scale_fill_viridis_c(name ="Abs. Error",option ="magma" ) +labs(title ="NB Model: Absolute Error") +theme_default()# Signed error (KDE)p_kde_err <-ggplot() +geom_sf(data = fishnet, aes(fill = error_kde), color =NA) +scale_fill_gradient2(name ="Error",low ="#2166ac",mid ="white",high ="#b2182b",midpoint =0,limits =c(-10, 10) ) +labs(title ="KDE Baseline: Signed Error") +theme_default()# Absolute error (KDE)p_kde_abs <-ggplot() +geom_sf(data = fishnet, aes(fill = abs_error_kde), color =NA) +scale_fill_viridis_c(name ="Abs. Error",option ="magma" ) +labs(title ="KDE Baseline: Absolute Error") +theme_default()# Combine 4 plots in 2×2 layout(p_nb_err + p_nb_abs) /(p_kde_err + p_kde_abs)```- **NB Model: Signed Error(Top-Left)** - Under-prediction happens more in major hotspot regions (especially central–south), while over-prediction appears more in low-crime northern areas.\ - This reflects the NB model's reliance on **built-environment variables**, which smooth the predictions and do not fully capture local spikes.- **NB Model: Absolute Error(Top-Right)** - Absolute errors are larger mainly in districts with very high or very low burglary counts.\ - High-crime areas in the south show moderate errors but not extreme ones.\ - The NB model handles high-crime zones better because built-environment variables explain these areas more consistently.- **KDE Baseline: Signed Error(Bottom-Left)** - KDE errors are much more random and noisy.\ - The spatial error pattern lacks structure; KDE simply smooths nearby crime counts.\ - This randomness reflects that **KDE is not a model**. It only uses spatial proximity and does not incorporate zoning or urban characteristics.- **KDE Baseline: Absolute Error(Bottom-Right)** - Absolute errors are generally smaller than NB (expected), because KDE is effectively a **smoothed version of the real data**.- NB shows more structured errors because it predicts using **explanatory variables**. This is not a failure, it's evidence that the **NB model is capturing urban mechanisms**, not just smoothing past data.## 6.6 Chapter Summary- **KDE Baseline:** - This chapter built a KDE (Kernel Density Estimation) surface using the burglary point pattern.\ - KDE acts as a **spatial baseline**: it uses only the locations of past events, without any built-environment variables.- **Final NB Model and Predictions:** - In this chapter, I fitted a final Negative Binomial model using all training cells from `fishnet_model`, including `mean_bldg_age`, `knn_mean`, `dist_to_hotspot`, `zone_prefix_dom`.\ - Research generated NB predictions for each grid cell and stored them as `prediction_nb`. Meanwhile, this chapter used the KDE values to create a KDE prediction `prediction_kde` for each cell.\ - For both approaches, compute signed error and absolute error relative to the actual burglary counts.- **Overall Interpretation of Step 6:** - Step 6 evaluates whether the Negative Binomial model provides reasonable and useful predictions compared to a strong spatial baseline (KDE).\ - Results show that: - KDE achieves slightly lower MAE/RMSE because it effectively reproduces the observed pattern by smoothing the data.\ - The NB model offers **interpretable, mechanism-based predictions**, linking burglary risk to building age, zoning, rodent baiting requests hotspot distance, and its neighborhood feature, even though its errors are somewhat higher.\ - Therefore, KDE is useful as a **benchmark for spatial clustering**, while the NB model is more valuable for understanding and explaining where and why burglaries concentrate, and for generalizing to new times or scenarios.# Step7: Summary Statistics and Tables## 7.1 Model Summary Table```{r}# Lookup full zoning labelszoning_labels <-c("RS"="Residential Single-Unit District","RT"="Residential Two-Flat, Townhouse and Multi-Unit District","RM"="Residential Multi-Unit District","B"="Business","C"="Commercial","DC"="Downtown Core District","DR"="Downtown Residential District","DS"="Downtown Service District","DX"="Downtown Mixed-Use District","M"="Manufacturing","PMD"="Planned Manufacturing Districts","PD"="Planned Development","T"="Transportation","POS"="Parks and Open Space")model_summary <- broom::tidy(final_model, exponentiate =TRUE) %>%mutate(term = dplyr::case_when( term =="(Intercept)"~"Intercept", term =="mean_bldg_age"~"Mean building age", term =="knn_mean"~"KNN mean (neighboring rodent baiting requests)", term =="dist_to_hotspot"~"Distance to rodent hotspot",# Zoning coefficients: zone_prefix_domRS, zone_prefix_domRTgrepl("^zone_prefix_dom", term) ~paste0("Zoning: ",sub("zone_prefix_dom", "", term), " (", zoning_labels[sub("zone_prefix_dom", "", term)],")" ),TRUE~ term ),# Round all numeric columns dplyr::across(where(is.numeric), ~round(.x, 3)) )model_summary %>%kable(caption ="Final Negative Binomial Model Coefficients (Exponentiated Rate Ratios)",col.names =c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value") ) %>%kable_styling(bootstrap_options =c("striped", "hover")) %>%footnote(general =paste("Rate ratios > 1 indicate a positive association with burglary counts,","while rate ratios < 1 indicate a negative association.","Zoning (land use) is a categorical variable; each zoning category is interpreted","relative to the reference category (the omitted baseline level of zone_prefix_dom)." ) )```- **Mean building age** has a strong and significant positive effect. For each extra year of average building age in a grid cell, the expected burglary count increases by about 2%, holding other variables constant.- **Manufacturing (M) and Planned Manufacturing Districts (PMD)** both have p \< 0.001, and their Rate Ratio both lower than 1, meaning that burglary risk in industrial zones is much lower than in the reference zoning category.- **Planned development areas(PD)**(RR ≈ 0.80, p ≈ 0.015) also show lower burglary risk.- **Residential Single-Unit (RS)**(RR ≈ 0.87, p ≈ 0.013) have somewhat lower burglary counts than the reference zoning.- Some downtown categories (DR, DS) have extreme rate ratios and huge standard errors, likely because there are very few grid cells of these types, so their coefficients should be interpreted with caution.## 7.2 Discussion and LimitationsThis study shows that the NB model can capture some broad burglary patterns, such as higher risk in areas with older buildings and lower risk in manufacturing zones. However, the model also has clear limitations. Some of the variables I included—such as rodent KNN and distance to rodent hotspots—were chosen somewhat loosely and do not strongly relate to burglary in the results. Because of this, the model cannot explain the sharp hotspots seen in the actual data. Overall, while the NB model offers some useful insights, its performance is restricted by both the choice of variables and the model structure, and future work should include more relevant predictors and more advanced spatial methods.